Supporting Similarity Queries in Apache AsterixDB
نویسندگان
چکیده
Many applications require similarity query processing. Most existing work took an algorithmic approach, developing indexing structures, algorithms, and/or various optimizations. In this work, we choose to take a different, systems-oriented approach. We describe the support for similarity queries in Apache AsterixDB, a parallel, open-source Big Data management system for NoSQL data. We describe the lifecycle of a similarity query in the system, including the support provided at the query language level, indexing, execution plans (with and without indexes), plan rewrites to optimize query execution, and so on. Our approach leverages the existing infrastructure of AsterixDB, including its operators, parallel query engine, and rule-based query optimizer. We have conducted an experimental study using several large, real data sets on a parallel computing cluster to evaluate AsterixDB’s support for similarity queries, and we share the efficacy and performance results here.
منابع مشابه
Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark
Large quantities of raw data are being generated by many different sources in different formats. Private and public sectors alike acclaim the valuable information and insights that can be mined from such data to better understand the dynamics of everyday life, such as traffic, worldwide logistics, and social behavior. For this reason, storing, managing, and analyzing “Big Data” at scale is gett...
متن کاملAsterixDB: A Scalable, Open Source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today’s open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that su...
متن کاملScalable Fault-Tolerant Data Feeds in AsterixDB
In this paper we describe the support for data feed ingestion in AsterixDB, an open-source Big Data Management System (BDMS) that provides a platform for storage and analysis of large volumes of semi-structured data. Data feeds are a mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. The need to pe...
متن کاملPRINCIPLES AND APPLICATIONS FOR SUPPORTING SIMILARITY QUERIES IN NON-ORDERED-DISCRETE AND CONTINUOUS DATA SPACES By
PRINCIPLES AND APPLICATIONS FOR SUPPORTING SIMILARITY QUERIES IN NON-ORDERED DISCRETE AND CONTINUOUS DATA SPACES
متن کاملThe SQL++ Query Language: Configurable, Unifying and Semi-structured
NoSQL databases support semi-structured data, typically modeled as JSON. They also provide limited (but expanding) query languages. Their idiomatic, non-SQL language constructs, the many variations, and the lack of formal semantics inhibit deep understanding of the query languages, and also impede progress towards clean, powerful, declarative query languages. This paper specifies the syntax and...
متن کامل